DIABETES RISK PREDICTION MODEL (DRPM)

Building a Risk Prediction Model for Type II Diabetes Using Machine Learning Techniques

CONTEXT

Diabetes is one of the most prevalent chronic diseases in Africa and the world at large, impacting millions of the world population each year and exerting a significant financial burden on the continent’s economy. Diabetes is a chronic illness that impairs an individual’s capacity to control blood glucose levels. It can shorten life expectancy and lower quality of life.

Complications of Diabetes include heart diseases, Retinopathy (causing vision loss), Diabetic foot (ultimately leading to amputation), and Nephropathy (kidney diseases). Diabetes cannot be cured, but many individuals can lessen its negative effects by adopting healthy eating habits, exercising, weight loss, and receiving medical care especially if detected early. In public health management and policy making, predictive models for diabetes risk are valuable resources since early diagnosis can result in lifestyle modifications and more successful treatment.

The Center for Disease Control and Prevention (CDC) estimates that 1 in 5 diabetics, and roughly 8 in 10 prediabetics are unaware of their risk. While there are different types of diabetes, type II diabetes is the most common form and its prevalence varies by age, education, income, location, race, and other social determinants of health. However, much of the burden of the disease falls on those of lower socioeconomic status.

CONTENT

This dataset is the output of a Chinese research study conducted in 2016. It includes 1304 samples of patients who tested positive for Type II diabetes out of 4303 samples, with age of the participants ranging from 21 to 99 years old. The dataset was collected according to the indicators and standards of the World Health Organization. For this project, a csv of the dataset available on Kaggle was used.

AIM

Develop a predictive model to identify individuals at high risk of developing type II diabetes.

OBJECTIVES

  1. Accurately predict the likelihood of diabetes onset, enabling early intervention.
  2. Highlight the most significant risk factors for diabetes.
  3. Analyze how these social determinants of health influence diabetes risk.
  4. Use the insights gained from analysis and predictive modeling to inform public health strategies and policy decisions.

RESEARCH QUESTIONS

Can survey questions and investigations from the dataset provide accurate predictions of whether an individual has diabetes? What risk factors are most predictive of diabetes risk?

FEATURES

Age Gender BMI SBP (Systolic Blood Pressure) DBP (Diastolic Blood Pressure) FPG (Fasting Plasma Glucose) FFPG (Final Fasting Plasma Glucose) Cholesterol Triglyceride HDL (High-Density Lipoprotein) LDL (Low-Density Lipoprotein) ALT (Alanine Aminotransferase) BUN (Blood urea nitrogen) CCR (Creatinine Clearance) Smoking Status: (1: Current Smoker, 2: Ever Smoked, 3: Never Smoked) Drinking Status: (1: Current Drinker, 2: Ever Drank, 3: Never Drank) Family History of Diabetes: (1: Yes, 0: No) Diabetes

INSTALL PACKAGES

readr - for reading csv file dplyr - for data manipulation Hmisc - for data analysis, manipulation and statistical modeling ggplot2 - for visualization Plotly - for creating interactive and engaging visualization randomForest - Model Building caret - for Model Evaluation pRoc - for analyzing and visualizing the performance of scoring classifiers in binary classification tasks rpart - creating recursive partitioning models and evaluating how much each feature contributes

LOAD LIBRARY AND DATASET

Loading dataset using the readr library

library(readr)
## Warning: package 'readr' was built under R version 4.2.3
data <- read_csv("/Users/m1/Desktop/OGUNDIPE/PAU/Notes/Datasets/Diabetes Dataset.csv")
## Rows: 4303 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (18): Age, Gender, BMI, SBP, DBP, FPG, Chol, Tri, HDL, LDL, ALT, BUN, CC...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

DATA CLEANING AND PREPROCESSING

Summary of Data, Data Manipulation and cleaning

Inspection of dataset -Finding missing values -Finding outliers -And other inconsistencies

Specifications and Quick glimpse of data

To get an overview of the structure and contents of the dataset.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
spec(data)
## cols(
##   Age = col_double(),
##   Gender = col_double(),
##   BMI = col_double(),
##   SBP = col_double(),
##   DBP = col_double(),
##   FPG = col_double(),
##   Chol = col_double(),
##   Tri = col_double(),
##   HDL = col_double(),
##   LDL = col_double(),
##   ALT = col_double(),
##   BUN = col_double(),
##   CCR = col_double(),
##   FFPG = col_double(),
##   smoking = col_double(),
##   drinking = col_double(),
##   family_histroy = col_double(),
##   Diabetes = col_double()
## )
glimpse(data)
## Rows: 4,303
## Columns: 18
## $ Age            <dbl> 26, 40, 40, 43, 36, 46, 52, 33, 42, 37, 51, 38, 50, 53,…
## $ Gender         <dbl> 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2…
## $ BMI            <dbl> 20.10, 17.70, 19.70, 23.10, 26.50, 20.50, 31.70, 22.90,…
## $ SBP            <dbl> 119, 97, 85, 111, 130, 88, 129, 129, 109, 128, 109, 95,…
## $ DBP            <dbl> 81, 54, 53, 71, 82, 63, 84, 92, 56, 75, 84, 58, 77, 86,…
## $ FPG            <dbl> 5.80, 4.60, 5.30, 4.50, 5.54, 5.76, 5.90, 5.17, 5.06, 4…
## $ Chol           <dbl> 4.36, 3.70, 5.87, 4.05, 6.69, 4.60, 6.14, 6.02, 4.73, 6…
## $ Tri            <dbl> 0.86, 1.02, 1.29, 0.74, 3.49, 1.00, 2.18, 3.90, 1.02, 0…
## $ HDL            <dbl> 0.90, 1.50, 1.75, 1.27, 0.91, 1.32, 1.15, 1.09, 1.15, 2…
## $ LDL            <dbl> 2.43, 2.04, 3.37, 2.60, 3.64, 2.78, 3.43, 3.12, 2.82, 4…
## $ ALT            <dbl> 12.0, 9.2, 10.1, 36.5, 69.3, 15.0, 26.0, 39.6, 11.5, 8.…
## $ BUN            <dbl> 5.40, 3.70, 4.10, 4.38, 3.86, 4.19, 4.70, 4.48, 2.98, 5…
## $ CCR            <dbl> 63.8, 70.3, 61.1, 73.4, 67.5, 59.0, 79.0, 68.3, 80.2, 6…
## $ FFPG           <dbl> 5.40, 4.10, 4.85, 5.30, 5.53, 4.80, 5.48, 5.84, 5.20, 5…
## $ smoking        <dbl> 3, 1, 3, 2, 3, 3, 1, 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 3, 3…
## $ drinking       <dbl> 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3…
## $ family_histroy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Diabetes       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Find Missing Values

Using the Hmisc package to describe data to view missing values and get other descriptions.

library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.2.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(data)
## data 
## 
##  18  Variables      4303  Observations
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0       70        1    48.09    16.73       28       30 
##      .25      .50      .75      .90      .95 
##       35       46       59       68       74 
## 
## lowest : 22 23 24 25 26, highest: 87 88 90 91 93
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct     Info     Mean      Gmd 
##     4303        0        2    0.684    1.352   0.4561 
##                       
## Value          1     2
## Frequency   2790  1513
## Proportion 0.648 0.352
## --------------------------------------------------------------------------------
## BMI 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      275        1    24.12    3.812     18.8     19.8 
##      .25      .50      .75      .90      .95 
##     21.7     24.0     26.3     28.4     29.9 
## 
## lowest : 15.6 15.9 16.2 16.4 16.5, highest: 37   37.3 38.7 39.5 45.8
## --------------------------------------------------------------------------------
## SBP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      111        1    123.2    19.48     98.0    102.0 
##      .25      .50      .75      .90      .95 
##    111.0    122.0    134.0    146.0    154.9 
## 
## lowest :  72  77  83  85  86, highest: 191 197 198 199 200
## --------------------------------------------------------------------------------
## DBP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0       75    0.999    76.36    12.34       60       63 
##      .25      .50      .75      .90      .95 
##       69       76       83       90       95 
## 
## lowest :  45  47  48  49  50, highest: 116 120 126 131 134
## --------------------------------------------------------------------------------
## FPG 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      361        1    5.226   0.8816     4.01     4.26 
##      .25      .50      .75      .90      .95 
##     4.70     5.14     5.70     6.40     6.69 
## 
## lowest : 1.78 2.84 2.93 2.95 3   , highest: 6.95 6.96 6.97 6.98 6.99
## --------------------------------------------------------------------------------
## Chol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      486        1    4.861    1.026     3.50     3.76 
##      .25      .50      .75      .90      .95 
##     4.20     4.79     5.43     6.05     6.45 
## 
## lowest : 1.65  2.12  2.21  2.42  2.45 , highest: 8.98  9.28  9.56  9.8   11.65
## --------------------------------------------------------------------------------
## Tri 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      478        1    1.588      1.1    0.510    0.602 
##      .25      .50      .75      .90      .95 
##    0.860    1.280    1.940    2.830    3.559 
## 
## lowest : 0     0.03  0.12  0.18  0.21 , highest: 12.6  13.1  13.26 14.93 32.64
## --------------------------------------------------------------------------------
## HDL 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      188    0.999    1.667   0.8902    0.900    0.990 
##      .25      .50      .75      .90      .95 
##    1.130    1.340    1.610    2.364    4.861 
## 
## lowest : 0       0.45    0.48    0.52    0.56   
## highest: 2.85    2.98    3.29    3.87    4.86075
## --------------------------------------------------------------------------------
## LDL 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      384    0.999    2.976   0.9834    1.790    2.000 
##      .25      .50      .75      .90      .95 
##    2.350    2.820    3.420    4.708    4.861 
## 
## lowest : 0.54 0.55 0.69 0.88 1   , highest: 5.54 5.58 5.95 6.09 6.27
## --------------------------------------------------------------------------------
## ALT 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      680        1    26.75    18.97     9.30    11.00 
##      .25      .50      .75      .90      .95 
##    14.30    20.50    31.05    48.00    65.00 
## 
## lowest : 4.5     4.86075 5       5.2     5.8    
## highest: 198     245.9   264     279.2   436.2  
## --------------------------------------------------------------------------------
## BUN 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      576        1    4.841    1.367    3.021    3.390 
##      .25      .50      .75      .90      .95 
##    3.960    4.760    5.570    6.430    7.078 
## 
## lowest : 1.38  1.93  1.95  2     2.21 , highest: 11.42 12.79 13.31 14.64 17.73
## --------------------------------------------------------------------------------
## CCR 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      700        1    71.56     18.3     47.3     51.5 
##      .25      .50      .75      .90      .95 
##     60.0     72.0     82.3     91.7     97.0 
## 
## lowest : 4.86075 35.6    36.5    36.8    37     
## highest: 143.1   145     159.6   214.4   307    
## --------------------------------------------------------------------------------
## FFPG 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4303        0      479        1    5.729    1.294    4.490    4.630 
##      .25      .50      .75      .90      .95 
##    4.900    5.300    6.020    7.366    8.400 
## 
## lowest : 3.2  3.57 3.9  3.91 3.95, highest: 15.5 15.6 16.2 20.6 29.7
## --------------------------------------------------------------------------------
## smoking 
##        n  missing distinct     Info     Mean      Gmd 
##     4303        0        4    0.782    3.006    1.222 
##                                               
## Value      1.000000 2.000000 3.000000 4.860753
## Frequency       745      136     2534      888
## Proportion    0.173    0.032    0.589    0.206
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## drinking 
##        n  missing distinct     Info     Mean      Gmd 
##     4303        0        4    0.728     3.21   0.9092 
##                                               
## Value      1.000000 2.000000 3.000000 4.860753
## Frequency        83      583     2749      888
## Proportion    0.019    0.135    0.639    0.206
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## family_histroy 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     4303        0        2    0.173      265  0.06158   0.1156 
## 
## --------------------------------------------------------------------------------
## Diabetes 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     4303        0        2    0.633     1303   0.3028   0.4223 
## 
## --------------------------------------------------------------------------------

Detect and remove Outliers in Major Features by Filtering

This was done subjectively based on clinical features and realistically possible known values based on previous documented patient presentations and texts.

par(mfrow = c(1, 4))
boxplot(data$BMI, main="BMI Boxplot")
boxplot(data$SBP, main="SBP Boxplot")
boxplot(data$DBP, main="DBP Boxplot")
boxplot(data$Chol, main="Chol Boxplot")

data_updated <- data[data$BMI < 50 & data$BMI > 15 & data$SBP < 240 & data$DBP > 50 & data$Chol < 10, ]
par(mfrow = c(1, 4))
boxplot(data_updated$BMI, main="BMI Boxplot")
boxplot(data_updated$SBP, main="SBP Boxplot")
boxplot(data_updated$DBP, main="DBP Boxplot")
boxplot(data_updated$Chol, main="Chol Boxplot")

Feature Engineering

Mean Arterial Pressure (MAP) and BMI Categorization

MAP is the average arterial pressure throughout one cardiac cycle, systole, and diastole. In individuals with diabetes this is significant due to the vascular complications associated with this condition. Diabetes can lead to changes in blood vessels, including arteriosclerosis (hardening of the arteries) and the development of atherosclerosis (formation of plaques within the arterial walls), which can alter blood pressure levels and affect organ perfusion. Monitoring and managing MAP in individuals with diabetes is essential for reducing the risk of cardiovascular complications, preventing target organ damage and managing co-morbid hypertension. Effective blood pressure control, as part of a comprehensive diabetes management plan, can significantly improve outcomes and quality of life for people with diabetes.

BMI Categorization is necessary to classify the BMI feature in the dataset and for ease of description of insights and visualizations.

data_updated$MAP <- with(data_updated, DBP + (1/3) * (SBP - DBP))
data_updated$BMI_Category <- cut(data_updated$BMI, 
                               breaks=c(-Inf, 18.5, 25, 30, Inf), 
                               labels=c("Underweight", "Normal", "Overweight", "Obese"))
table(data_updated$BMI_Category)
## 
## Underweight      Normal  Overweight       Obese 
##         160        2501        1428         204

Cross Check Data

Using the Str function of the dplyr library to have another quick look at the structure of the updated data.

str(data_updated)
## tibble [4,293 × 20] (S3: tbl_df/tbl/data.frame)
##  $ Age           : num [1:4293] 26 40 40 43 36 46 52 33 42 37 ...
##  $ Gender        : num [1:4293] 1 1 2 1 1 2 1 1 1 2 ...
##  $ BMI           : num [1:4293] 20.1 17.7 19.7 23.1 26.5 ...
##  $ SBP           : num [1:4293] 119 97 85 111 130 88 129 129 109 128 ...
##  $ DBP           : num [1:4293] 81 54 53 71 82 63 84 92 56 75 ...
##  $ FPG           : num [1:4293] 5.8 4.6 5.3 4.5 5.54 5.76 5.9 5.17 5.06 4.67 ...
##  $ Chol          : num [1:4293] 4.36 3.7 5.87 4.05 6.69 4.6 6.14 6.02 4.73 6.75 ...
##  $ Tri           : num [1:4293] 0.86 1.02 1.29 0.74 3.49 1 2.18 3.9 1.02 0.61 ...
##  $ HDL           : num [1:4293] 0.9 1.5 1.75 1.27 0.91 1.32 1.15 1.09 1.15 2.4 ...
##  $ LDL           : num [1:4293] 2.43 2.04 3.37 2.6 3.64 2.78 3.43 3.12 2.82 4.25 ...
##  $ ALT           : num [1:4293] 12 9.2 10.1 36.5 69.3 15 26 39.6 11.5 8.3 ...
##  $ BUN           : num [1:4293] 5.4 3.7 4.1 4.38 3.86 4.19 4.7 4.48 2.98 5.03 ...
##  $ CCR           : num [1:4293] 63.8 70.3 61.1 73.4 67.5 59 79 68.3 80.2 62.7 ...
##  $ FFPG          : num [1:4293] 5.4 4.1 4.85 5.3 5.53 4.8 5.48 5.84 5.2 5.19 ...
##  $ smoking       : num [1:4293] 3 1 3 2 3 3 1 3 3 3 ...
##  $ drinking      : num [1:4293] 3 1 3 3 3 3 3 3 3 3 ...
##  $ family_histroy: num [1:4293] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Diabetes      : num [1:4293] 0 0 0 0 0 0 0 0 0 0 ...
##  $ MAP           : num [1:4293] 93.7 68.3 63.7 84.3 98 ...
##  $ BMI_Category  : Factor w/ 4 levels "Underweight",..: 2 1 2 2 3 2 4 2 3 2 ...

Insights and Visualizations

Using ggplot function of the ggplot2 library to create visualisations and get insights from the data.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplot(data_updated, aes(x = BMI, fill = factor(Diabetes))) + 
  geom_histogram(binwidth = 1, color = "black") + 
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), name = "Diabetes") +
  ggtitle("BMI Distribution") +
  labs(x = "BMI", y = "Frequency") +
  theme_minimal()

ggplot(data_updated, aes(x = BMI_Category, fill = factor(Diabetes))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "BMI Category Count", x = "BMI Category", y = "Count") +
  scale_fill_manual(values = c("0" = "blue", "1" = "red")) + 
  theme_minimal()

ggplot(data_updated, aes(x = Age, fill = factor(Diabetes))) + 
  geom_histogram(binwidth = 1, color = "black") + 
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), name = "Diabetes") +
  ggtitle("Age Distribution") +
  labs(x = "Age", y = "Frequency") +
  theme_minimal()

ggplot(data_updated, aes(x = SBP)) + 
  geom_histogram(binwidth = 10, boundary = 10, fill = "green", color = "black") + 
  ggtitle("Systolic Blood Pressure (SBP) Distribution") +
  labs(x = "SBP Range", y = "Frequency") +
  theme_minimal()

ggplot(data_updated, aes(x=DBP)) + 
  geom_histogram(binwidth=10, boundary = 10, fill="red", color="black") + 
  ggtitle("Diastolic Blood Pressure (DBP) Distribution")

ggplot(data_updated, aes(x = MAP, fill = factor(Diabetes))) + 
  geom_histogram(binwidth = 1, boundary = 0, color = "black") + 
  scale_fill_manual(values = c("0" = "beige", "1" = "red"), name = "Diabetes") +
  ggtitle("Mean Arterial Pressure (MAP) Distribution") +
  labs(x = "MAP Range", y = "Frequency") +
  theme_minimal()

ggplot(data_updated, aes(x = Chol, fill = factor(Diabetes))) + 
  geom_histogram(binwidth = 1, color = "black") + 
  scale_fill_manual(values = c("0" = "purple", "1" = "orange"), name = "Diabetes") +
  ggtitle("Cholesterol Distribution") +
  labs(x = "Cholesterol Range", y = "Frequency") +
  theme_minimal()

interactive_plot <- ggplot(data_updated, aes(x=Age, y=BMI, color=factor(Diabetes))) + 
  geom_point() + 
  ggtitle("BMI vs. Age by Diabetes Status")
ggplotly(interactive_plot) 
ggplot(data_updated, aes(x=Age, y=BMI_Category, color=factor(Diabetes))) + 
  geom_point() + 
  ggtitle("Diabetes by BMI Class and Age Distribution")

ggplot(data_updated, aes(x = as.factor(Diabetes), y = Age)) + 
  geom_boxplot(fill = "cyan", color = "black") +
  theme_minimal() +
  ggtitle("Age Distribution by Diabetes Status")

Converting Target Variable from numerical to factor

Diabetes column indicates whether an individual has diabetes ( 0 and 1 ), This column is essentially categorical, despite being represented with numeric codes as seen in the structure. Treating numeric representations of categories, especially for something like a diabetes status indicator, as double-precision floating-point numbers might not be the most appropriate.The specification of the Diabetes column should be treated as a factor (categorical variable). To improve clarity in data and ensure that statistical analyses and ML are correctly performed.This would also help if there is need to balance the dataset.

data_updated$Diabetes <- as.factor(data_updated$Diabetes)
levels(data_updated$Diabetes) <- c("0", "1")

Cross Check Data

Using the view function to have another quick look at the updated data and be sure data is clean and processed before building a model.

View(data_updated)

MACHINE LEARNING MODELLING

Load the caret, randomForest and pROC packages Split the data into training and testing sets.

Train Random Forest Model

The model was created using the randomForest function where the target variable is Diabetes, and all other variables in the dataset were used as dependent variables. It consists of 500 trees, and at each split, three variables were tried.

Predict on test data Evaluate Further Evaluation using other performance metrics

library(caret)
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Split the data into training and testing sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(data_updated$Diabetes, p = .7, 
                                  list = FALSE, 
                                  times = 1)
diabetes_train <- data_updated[trainIndex, ]
diabetes_test <- data_updated[-trainIndex, ]

# Train the Random Forest model
rf_model <- randomForest(Diabetes ~ ., data = diabetes_train, ntree = 500, mtry = 3)

# Print the model summary
print(rf_model)
## 
## Call:
##  randomForest(formula = Diabetes ~ ., data = diabetes_train, ntree = 500,      mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 4.32%
## Confusion matrix:
##      0   1 class.error
## 0 2056  38  0.01814709
## 1   92 820  0.10087719
# Predict on the test data
predictions <- predict(rf_model, diabetes_test)

# Evaluate the model performance
confusionMatrix <- table(diabetes_test$Diabetes, predictions)
print(confusionMatrix)
##    predictions
##       0   1
##   0 875  22
##   1  36 354
# Calculate other performance metrics - Accuracy, Precision, Recall

accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
precision <- diag(confusionMatrix) / colSums(confusionMatrix)
recall <- diag(confusionMatrix) / rowSums(confusionMatrix)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.954933954933955"
print(paste("Precision:", precision))
## [1] "Precision: 0.960482985729967" "Precision: 0.941489361702128"
print(paste("Recall:", recall))
## [1] "Recall: 0.975473801560758" "Recall: 0.907692307692308"
#ROC
roc_response <- roc(diabetes_test$Diabetes, as.numeric(predictions))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_response)

auc(roc_response)
## Area under the curve: 0.9416

Plot feature importance

Check for importance of each feature

library(rpart)
## Warning: package 'rpart' was built under R version 4.2.3
DRP_model <- rpart(Diabetes ~ ., data = data_updated)
DRP_model$variable.importance
##    drinking     smoking        FFPG         HDL         LDL         FPG 
## 1085.377738 1085.377738  513.035508  265.063559  218.469221  196.908898 
##         CCR         DBP         SBP 
##    7.351077    2.460003    2.460003

DISCUSSION

Error Rate

The out-of-bag (OOB) estimate of the error rate is approximately 4.32%. The OOB error is an estimate of the model’s prediction error calculated using the out-of-bag samples. It provides an estimate of how well the model is likely to perform on unseen data. A lower OOB error indicates better predictive performance, with values closer to 0 indicating higher accuracy.

Confusion Matrix

The confusion matrix shows the model’s performance on the training data. It indicates the number of true negatives (TN) - 2056 false positives (FP) - 38 false negatives (FN) - 92 true positives (TP) - 820 Class Error 0 - 0.018 or 1.8%. Class Error 1 - 0.101 or 10.1%

Performance Metrics

The accuracy of the model on the training data is approximately 95.49%. While Precision for class 0 (No Diabetes) - 96.05% for class 1 (No Diabetes) - 94.15% Recall/Sensitivity Recall for class 0 (No Diabetes) - 97.55% Recall for class 1 (Diabetic) - 90.77% The area under the curve (AUC) is a measure of the model’s ability to distinguish between classes. An AUC value of 0.9416 indicates a good predictive performance of the model.

Variable Importance

The analysis also provides variable importance measures for each predictor variable used in the model. The higher the value, the more important the variable is in predicting the outcome. The most important variables are drinking, smoking and Fasting Plasma Glucose

Overall, the random forest model demonstrates strong performance in classifying diabetes based on the provided predictors, with high accuracy, precision, recall, and AUC. Additionally, the variable importance measures provided insights into which features are most influential in predicting diabetes.